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ABSTRACT 

The nature of the criterion (dependent) variable may 
play a uoeful role in structuring a list of classification/prediction 
problems. Such criteria are continuous in nature, binary dichotomous, 
or multichotomous. In this paper, discussion is limited to the 
continuous normally distributed criterion scenarios. For both cases, 
it is assumed that the predictor variables are continuous 
multivariate normal. For the binary variable case, the multivariate 
normal assumption is conditioned on the binary criterion, that is, 
for each value of the binary criterion, the predi;:tors ave 
multivariate normal with a common covariance matrix, but different 
centroids. In other words, for the continuous criterion case, the 
correlations model is used, while for the binary case the assumptiojis 
associated with the classic two-group discriminant analysis problem 
are employed. When these two models fit some population of data, then 
the use of standard loss functions yi Ids well known 
population-optimal solutions. A unified framework for classification 
and prediction problems is presented. Ksll known and lesser known 
relationships among correlations, distances and error rates are 
established. A new population distance, the shrunken generalized 
distance, and a new estimator of the actual error rate are 
introduced! . (Author/PM) 
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Abstract 



A unified framework for classification ani prediction problems is pre- 
sented. Well known and lesser known relationships among correlations, 
distances and error rates are established. A new population distance, the 
shrunken generalized distance, and a new estimator of the actual error rate 
are introduced. 



Classification and prediction problems abound. An extensive list of 
prediction and classification examples is easy to generate. Such a list 
could be structured by searching for similarities and identifying differ- 
ences among the examples on it. Ultimately, each entry on the list could 
be viewed as a member of one of a smaller set of classes of prediction/ 
classification problems. 

The nature of the criterion (dependent) variable may play a useful role 
in structuring a list of classification/prediction problems. Some criteria 
are essentially continuous in nature, e.g., scores on a long test. Other 
criteria are inary, e.g., group membership. Other criteria appear binary 
but may be thought of as iichctomizations of a continuous underlying crite- 
rion, e.g., pass/fail grading of an essay. Othe" criteria are multl- 
chotomous. In this paper, discussion is limited to the contitiuous normally 
distributed criterion sceranios. For both cases, it will be assumed that 
the predictor variables are continuous multivariate normal. For the binary 
variable case^ the multivariate normal assumption is conditioned on the 
binary criterion, i.e., for each value of the binary criterion, the predic- 
tors are multivariate normal with a common covariance matrix, but different 
centroids. In other words, for the continuous criterion case, the correla- 
tions model is used, while for the binary case the assumptions associated 
with the classic two-group discriminant analysis problem are employea. 

When these two models fit some population of data, then the use of 
standard loss functions (least squares in the continuous criterion case; 
maximum probability ir the binary criterion case) yields well known 
population-optimal solutions. 
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Population Indices 

The Continuous Criterion Case 

For the continuous case* ordinary least squares regression yields the 
population optimal regression equation* 

(1) B - I "^a . 
-^p XX -xy 

where Is the r-by-r population covarlance matrix among the predictors 

(X) y and a Is the r-by-1 vector of covarlaacee between ^he predictors and 
"^y 

criterion y. The population multiple correlation, p^, or validity coeffi- 
cient, 

(2) p - (0 V )/(B 'Z 3 a ^)^^^ , 
^ ^ p -^^p-xy -=p xx-^ y 

Indexes the extent to which the predicted criterion orderlngs, 

(3) yp(ii) - l^ii + ^ . 

obtained by applying the regression weights to the r predictor scores for 
the Individual (x^) , matches the ordering of the criterion scores In the 
population. And, the population mean squared error, ^SE^, Indexes how 
accurately the predicted criterion scores match the actual criterion scores 
In the population, 

(4) MSEp - G(yp - Vp)^ • 

The population squared validity and mean squared error of prediction are 
related via 

(5) p ^ - 1 - MSE /a ^ . 

P P y 

The Binary Criterion Case 

In the standard two subpopulation classification case In which the 
subpopulatlons, Kj^ and are of equal size. I.e., prCK^ - pr(K2) ■ .5, 



3 



and the n predictors in both subpopulations follow a multivariate normal 
distribution with the same covariance matrix, E, and different centroids, 
Vj^ and optimal classification according to the maxinrum probability, 
maximum likelihood, and generalized distance rules (Huberty, 1975; 
Tatsuoka, 1971) all reduce to assignment to the subpopulation with the 
nearest centroid. Operationally, this is accomplished by computing the 
Wald-Anderson classification statistic 
(6) Wp(x^) - tx^ - .5(u^ - u^)] , 

where is r-by-1 vector containing Fisher's (1936) population linear 
discriminant weights, 

— p —1 —2 

The adequacy of classification in the population is indexed by the 
optimal error rate (Hills, 1966), 



(8) 


E - .5 * 
P 


" - V^i^ ~ 


+ 


.5 * 










L -P -P 








f 



which is the probability of misclassif ication associated with use of the 

populatior optimal classification rule, W . In (8), $ is the standard 

P 

normal distribution function. It has been shown that £^ can be expressed 
in terms of the separation between Kj^ and Y.^, the population Mahalonobis 
(1936) or generalized distance. 



(9) 



(Hi - H2^'^"^^Jil 



which can be thought of as the squared standardized difference between 
populations Kj^ and 1^^ along the dimension defined by X^, 
(10) 



p — p — 1 — z ~p ~p 
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In particular, E^^ can be expressed as a function of 6^ 



(11) 



E - ,5* 


[-.5 6 ^1 


+ .5* 


r-.5 6 n 


P 


P 




p 




6 




6 




P 




P 



which can be obtained by evaluating (6) at and * slmplylng 

the expression using (9) and (10). 



Parallels Betveen Con t inuous Criterion and Binary Criterion Cases 

There are parallels betveen the continuous criterion case and the bina- 
ry criterion case. There are parallel sets of weights: 6^ for the contin- 
uous criterion caset X for the binary criterion case. The squared corre- 

2 

latlon measure of association parallels the generalized distance 6^ . And 

the mean squared error of predlctlon» MSE f parallels the optimal error 

? 

rate, E . In fact» for the binary criterion case, B and X are known to 
P "T "T 

2 

be proportional. In addition. It can be shown that 6^ and p are related 
(See Appendix A) , 
(12) 



[pr(K^) • pr(K2)] 



Cross-Validity and the Actual Error Rate 
In practice, we seldom work with populations. Inscead, we are limited 

to samples of data. Substitution of sample mean, variances and covar lances 

2 

Into (1) - (12) produces sample analogues of 0 , X , p , 6 , MSE and E . 

P "1> P P P P 

For example, for the continuous criterion case, we have 

(13) b - C c 

-s XX -xy 
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where C and c are the sample analogues of I and o . For the binary 
criterion case» we have 

(14) 1^ - G"\f^ - . 

where G, Xj^ and X2 are the sample within-group covariance matrix and sample 
centroids, respectively. In general the usefulness of a regression equa- 
tion or a classification rule should be assessed by its performance in the 
population, not its performance in the sample. For the continuous criteri- 
on case, the population cross-validity coefficient, 

(15) R - (b *a )/(b h o 2)^/^- , 

c -s -xy xx-8 y 

and the mean squared error of prediction MS£^ associated with use of the 

sample weights, b , in the population index the long-term usefulness of the 

sample weights. Lord (1950) developed an estimator for the MSE^ for when 

tha predictors are considered fixed, i.e., the regression model, while 

Stein (1960) developed an estimator for the MSE under the correlation 

c 

model. Browne (1975), as demonstrated by Drasgow, Dorans and Tucker (1979) 

and Drasgow and Dorans (1982), developed an estimator of the population 

squared cross-validity that is virtually unbiased and robust to violations 

of multivariate normality in the predictors. 

For the binary criterion case, the actual error rate, £ , summarizes 

c 

how well a sample classification rule works in the population. The actual 
error rate is the probability of "sisclassif ication associated with use of 
the sample classification rule in the population. In many ways, the actual 
error rate is more important than the optimal error rate. The actual error 
rate is akin to the population mean squared error rate associated with a 
sample regression equation. In the two equal-sized subpopulation case 
under consideration, the expression for the actual error rate is 
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(16) - .5* 
c 




+ .5* 






V 

_ w _ 




V 

w 



where W (iv ) Is the sample classification statistic W (x.) evaluated at \K , 

3 — iC S —1 

(17) W^(x^) - l^Mx^ - .5(x^ - X2)] , 

and Is the variance In each subpopulatlon of the composite defined by 
the sample discriminant velghts, 1 , 



(18) 



V„ - 1 1 . 

W -s -s 



The literature contains several estimators of the actual error rate for 

the two multivariate normal subpopulatlon case. One class of estimators, 

that are somewhat heuristic, are the distance-modification estimators. 

This class of estimators attempt to mimic the relationship between E and 

P 

2 

5^ stated in (11) by substituting distance estimates into 
(19) - *[(-.5 D)]. 

Two of the most popular distance modification methods are the D-nnethod 

2 

and the DS -method. The D-method uses the sample generalized distance D 



for D xn (19). The DS-method uses 
(20) 



D^g^ - (N-n-3)Dg^/(N-2) , 



wliich is the positive portion of an unbiased estimate of 5 , 

P 

(21) - D^g^ - Nn/(N/2)^ . 

2 ^2 
According to Lachenbruch and Mickey (1968), is used instead of 6 to 

Uo P 

avoid negative distance estimates. 
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The Shrunken Generalized Distance 

While attracted by the Intuitive appeal of these two distance- 
modification procedures, I am convinced that they are inappropriate, I.e. , 

2 

not the right distances. D Is like the sample squared multiple 

s 

2 

correlation, R ; in fact they can be iclated. Using a positively biased 
s 

estimate of the population Kab'^lonbis distance, as the DS«-method does, is 

like using a positively biased estimate of the population squared multiple 

2 

correlation P^" to estimate the population squared croflS-valldlty 
2 

coefficient . An estimate of some distance that was analogous to the 

squared cross-validity is clearly needed. So I invented (Dorans, 1979) the 

2 

shrunken generalized distance , , between two subpopulatlon centroids, 
and • 

The shrunken generalized distance is the squared standardized distance 
between the projections of the two subpopulatlon centroids onto the dimen- 
sion defined by the sample discriminant weights, 1 . These projections are 

^s 

obtained hy evaluating the sample classification statistic in ri7) at 
and ^® variance along this dimension is that defined in (18). The 

shrunken generalized distance is formally expressed as 

(22) d/. (W^(y^) -W^(U2))'/V„ . 
which can be rewritten as 

(23) d/- 1^' (u^- y^) • 

2 

To anpreciate what represents, it is helpful to resort to geometric 
imagery. For the case of two multivariate normal subpopulatlons with equal 
covarlance matrices and different centroids, the population discriminant 
weights define the dljienslon In the n-dlmenslonal predictor space along 

12 . 
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which there Is maximal separation between the subpopulatlona. As noted 

2 

earlier, the population generallred distance » 5^ , can be thought of as the 
squared standardized distance between the subpopulatlon centrolds* projec- 
tions on this dimension defined by X (See equation 10). 

Suppose that Instead of X^, we had uc^d the sample weights 1^ to define 
a dimension In the population. When the centrolds of the two sub- 
populations are projected onto this dimension, two means are produced, one 
for each subpopulatlon on this dimension. The squared standardized differ- 
ence between these means Is the shnmken generalized distance « Unless the 
dimension defined by the sample weights Is parallel or colllnear to the 
dimension defined by the population optlnil weights, this squared standard- 
ized difference in means will be smaller than the population generalized 
difitance. In other vords, the distance will have shrunken; hence, the 
phrcioe shrunk m generalized distance . 

This shrunken generalized distance should estimate the actual error 
rate better the modified distances used by the D-method and the DS-method, 
An estimator of the shrunken generalized distance was derived (See Appendix 
B), 

(24) , V N.3 + N^N^N-^N-r-Z) ^ 



(N-3) (r + N^N2N"^6p^) 



which njes the unbiased estimator of the population generalized distance 
defined in (21) and where Nj^ and ^" sample sizes for each subgroup, 
i.e., N " + 1^2. A simulation was conducted to compare this new shrunken 
distance estimator with the two other distance modification estimators, as 
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well as five other estimators. I expected the MU-estlmator, as I called 
lt» to be superior to the D-method and DS-method because It used the appro- 
priate distance, the shrunken generalized distance ♦ 

One of the five other estimators examined In the simulation is the 
t>S-method, which is based on Okamoto*s (1963) asjrmptotlc expansion of the 

disfi-^.bution of the sample Wald-Anderson statistic, W . Previous research 

s 

(\.achenbruch and Mickey, 1968; Sorum, 1972) had demonstrated that the 
OS-method was the best estlmrtor available. The equal N special case o£ 
Okamoto's OS-method was used, 
(25) 



E^(OS) - * (--SDds) + * ^-^^DS^ 



(r-1) (r-l)Djjg 



+ + 



I NDj^g 4N 4(N-2) 



In (25), * is the standard normal density function. 

The simulation study (Dorans, 1984) demonstrated that the MU-method is 
t\\<i best of the heuristic distance modification procedures. In addition, 
it seemed to perform as well as if not better than thi OS-method. The 
MU-methc^ works well because it is an estimator of the minimum actual error 
rate associated with use of the sample classification rule in the popula- 
tion. (See Appendix C for proof of this statement.) 

The Shrunken Generalized Distance and the Squared Cross-Validity 

2 

In Appendix A, it is demonstrated that the population parameters p and 

2 

6 are related as in equation (12). In Appendix D. the relationship 

2 

between the shrunken generalized distance, , and the squared cross- 
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2 

validity coefficient, , Is shown to be 
(26) ^^2 

/ ^^l'i2^ • 

^ (1-R ^) / ^ 
c 

where and q2 are the relative sizes of subpopulatlons and K^. This 

2 2 

relationship between and completes the unified framework for 

classification and prediction problems. 

The framework dlstlngi;lshes between continuous criterion cases and 

truly binary criterion cases* On the continuous side of the ledger we have 

S , p and MSE with (2), (3) and (5) serving as definitions and establlsh- 
-p -p p 

2 

Ing relationships. On the binary side we have the analogous » and 

£ with (7), (8), (9) and (ID) serving as defining relationships. Then 

2 2 

Appendix A demonstrates that 5^ and are related as in (12) • 

The framework Includes the use of sample weights In tha population. 

2 

For the continuous criterion case, we have R ?ind MSE . For the binary 

c c 

2 

criterion case, we have and Appendix D establishes the relation- 

2 2 2 

ship between R and D , while Appendix C showp how D and E may be 
c c c c 

related. 

In order to complete the framework for prediction/classification prob-* 

2 

lems, the notion of the shrunken generalized distance, D , was introduced. 

c 

In addition to being the missing piece in the analytic framework, this 
distance useful for estimating the actual error rate, as demonstrated 
elsewhere (Dorans, L98^). 
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APPENDIX A 



2 ^2 
RELATIONSHIP BETWEEN Pp AND 5p 



In general, let T be a (r+l)-Jjy-(r+l) covariance having the form: 
(A.l) r 



r * a 
XX ; -xy 



a * a 

L"yx» y J 

where, a Is a 1-by-r vector of covarlances for the criterion variable, Y, 

— yx 

with each of the r predictor variables X, is the intercovariance matrix 

2 

among the r predictors » and is the variance of the criterion. When Y 
is a binary variable representing group membership, taking the value 1 if 

an individual is from subpopulation K^^, and the value 0 if an individual is 

2 

from subpopulation K^, a and a take on special forms. In particular » 

z y ^yx 

2 

is defined as the product 

2 

(A. 2) ■ ^1 ^2 ' 

where q^^ and are the proportions of individuals in K^^ and respective- 
ly. The covariance vector takes on the form 
(A.3) - qj(l)(Hi - E) + q2(0)(E2 " ii^ 

" - qiEi - q2H2) 

" ^1^2(^1 - ll2^ 

where is the r-by^l centroid vector in Kj^ while p.^ Is the r-by-1 
centroid vector in and p is the grar^ mean 
(A. 4) p - qj^Pj^ + q2P2 

Therefore, v;hen group membership is coded as a binary variable, the general 
expression for ^ in (A.l) has the form 

i^l^2^iir-2^ 



(A.5) 



XX 
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In general » the population least squares regression weights 3 are 
defined as 

(A.6) e - Z "^a 

-p XX -xy 

which, for a binary criterion variable reduces to 

The population squared multiple correlation Is defined as 

(A. 8) (e;2,y)'/«'yV^^3p) 

(qiq2(!irii2)'^xx''(iir!i2>^i<i2>' 

-qiq2(Vii2>'^Kx'^%-V 
The total covariance matrix among the predictors can be broken up Into a 

withln-groups covariance matrix £ and a between-groups covariance matrix, 

(A.9) E^. J: + qiq2(yi -.VCjii -V' 
Substituting (A.9) into (A. 8) yields 

(A. 10) p2 - qiq2(]ii-^2^ ' *»l*»2^^r^i2^ ^^r^2^ ' 

' qiq2(iii-ii2^'^^^^ + qiq2^'^(lirli2)(lirJi2)')r^(jt3^-iL2) 
- qiq2(iirii2>' + qiq2^'^<iii-ii2^^VV^*^^'^^^rii2^ • 

At this point, it is necessary to use the known matrix algebra relation 
(Kshirsagar, 1972), 

(A. 11) (I + LM)"^ - I - Ul + ML)"-*^ M 
Let, 

(A. 12) L - qiq2^'^(iii - ^2^ 

and 

(A. 13) M - (Wj - li2>' • 

19 
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Then the relationship in (A. 11) enables us to rewrite (A. 10) as 



(A.iA) P2 _ qiq2^-!ir-2^' 



I - q^q2J:"^u^-U2)(y^-y2)' 
l + qiq2<iirif2)*^'^(iir-!i2^ 



qiq2^-ilrii2^'^'^^iirii2^ 



"(^1^2^ ^iii-u2) • ^"^(jirii2) (iirJi2^ ' ^iir-!i2^ 



By definition, the Manalanobis generalized distance between and 
equals 

(A.15) - (y^ - y2)'S"^(iii - V^) . 

When (A, 15) is substituted into (A. 14), the result is 

(A.16) p2 - q^a^&J^ - (q^q26p^)^/(l + qiq2 0 



1^2 p 



Thus, 



(A.17) p^l + qiq2«p^) " qiq2^p^ 

and by simple rearrangement of terms, one obtains 



(A. 18) 6 ^ - 
P 



q2) 
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APPENDIX B 

AN ESTIMATOR OF THE SHRUNKEN GENERALIZED DISTANCE 

Minimum Actual Error Rate 

The minimum actual error rate associated with a sample classification 
rule Is the minimum probability of misclasslf Icatlon In the population 
assocrated with discrimination along the dimension defined by the sample 
dlscrlrlnant weights. The minimum actual error rate Is the minimum possl-- 
ble error rate associated with using a sample classification rule in the 
population and can be expressed as a function of the "shrunken" Mahalanobis 
distance > 



(B.l) '"^'*^=c> - ^1* 



2n 



D 

c 



+ 



D 

c 



(The derivation of (B.l), which assimes equal costs of misclassification, 

appears in Appendix C.) In the balance of this appendix, an estimator of 

2 2 
E(D ) is developed. When substituted into (B.l) for D , this estiiuator 
c c 

can be used to approximate mln(E ), which in turn serves as an approxima- 
tion to E , the actual error rate, 
c 

2 2 
An Estimator of E(D ) in Terms of <5^ 
c— ^ ' p 

2 

If D^ and the "relative sizes" of the two subpopulations, and 
are known, (B.l) could be used as a lower bound for the actual error rate 

associated with the use of a sample classification rule in the population. 

2 

Unfortunately. D^ io expressed in terms of the population parameters u^f 
y^y And If and these quantities usually are unknown. (If these parameters 
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were kno%m» concern about the actual error rate would be unnecessary since 
the optimal classification rule would be knowableO An estimator for D 

c 

in terms of sample quantities i3 needed. In this section, the transforma- 

2 

tional invaria . 6^ , the distribution for sample discriminant weights 

1 and a logic paralleling that used by Drasgow, Dorans and Tucker (1979) 

2 

are combined to obtain an estimator of E(D ) which can be used to estimate 

c 

c 

2 

Since the population generalized distance 6^ is invariant with respect 
to nonsingular transformations (Lachcnbruch , 1975), it is possible to 
transform any r-dimenf»ional space into an orthogonal orientation, in which 
the first dimension is parallel to the line passing through the subpopula- 

tion centroids and where e'ch dimension has unit variance, without affect- 

2 2 
ing 5^ • In addition, 6^ is invariant to translations. These permissible 

transformations and translations can be applied to any two multivariate 

normal subpopulations with a common 1 and centroids and jj^^^ obtain a 

new covariance matrix £ and new centroids y^j^, and having special forms 

(B.2) Z « I 

and 

(B.3) (u^ - y^)' " 0, 0) . 

2 

The following derivation of the estimator for E(D ) uses the convenience 

c 

of working with two multivariate normal subpopulations having parameters I, 
Pj^, and This derivation, however, is not specific to this special type 

of population. It's applicability to any arbitrary pair of multivariate 

2 

normal subpopulations is guaranteed by the invariance of 5 to trans- 

P 

formation and translation. 
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The general expression for In (23) reduces to 



(B.4) 



6 h 2 
p sl 

1 '1 



for subpopulations having the parameters 2, S^, and C^, In (B.2) and (B.3). 

2 

The term 1 . , is the first element of 1 and 1 , is the first element in 

sl ^81 

the sum 

1 '1 - 1 ,^ + 1 + + 1 / -h ... + 1 2 . 

— s ^ sl s2 si sp 



Ov 



er random samples of size N» with group sample sizes N^^ and N^y the 



expected value of is 



(B.6) E(D ) - e 
c 



r 2 2-1 
^6 ^1 
p sl 



1 '1 

L ^ — 8 J 



which can be estimated by 

(B.7) E8t(E(D ^)) ' 6 ^e(l 

c p s 1 

e(l '1 ) 
— s — s 



In order to proceed further, expressions for the quantities e(l . ) and 

s 1 

e(l '1 ) are needed. Fortunately » Kshirsagar (1972) has derived an estima- 

S ""8 



tor for (N-2)"^e(l 1 '), 

—8—8 

(B.8) (N-2)-'e(l^l3') - 



[(N-3) +N^N2 5 + (N-r-3) NiN2^^^-!ir-!i2^ ^■lir-li2^ ' ^ 



where 
(B.9) 



G - N^N2(N-r-2) (N-r-3) (N-r-4)/N 



For subpopulations having parameters ^, V^, and U^' ^ simplified expression 
for e(l 1 ') can be otalned from (B.8), 

—8—8 

(B.IO) e(l 1 •) - (N-2)^ f[(N-3) + N.N. 5 VMl + N,N.N'^N-r-3) A 1 
— S— S I 2 p 12 pj 
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where t is a r-by-r singular matrix containing 5 as its first element 
P P 

and zeros elsewhere. 

The relationship (Kshirsagar, 1972) 

(B.ll) e(l '1 ) - trace [e(l 1 ')] 

can be used to obtain 

(B.12) e(l - (N-2)2 
—si 



(N-3) + N^N2N"^(N-r-2)6p^ 



and 
(B.13) 



^(1 '1 ) - 

— s ^ 



(N-3) [r + N,N^N"^6 ^] 
i I p 



Substituting (B.12) and (B.13) into (B.7) yields 
(B.IA) E«(e(D^^)) - 6 2 + N^N2N"^N-r-2) 6^ ] 



(N-3)[r + N,N^N"^6 ^] 
i Z p 

2 2 
an estimate for e(D^ ) in terms of 5^ , the population generalized dis- 

2 2 

tance. (Note that when r - 1, Est(e(D )) equals 5 . This result is not 

c ^ p 

surprising. When the original subpopulations are unidimensional, a sample 
discriminant function merely rescales the original dimension and the stan- 
dardized distance between population centroids along that dimension remains 
invariant . ) 



2 

An Estimator for E(D^ ) in Terms of Sample Quantities 

2 

For (B.14) to be usable an estimator for the term ^ is needed. 

P 

Either (N-r-3) [D^^ - (N-2)Nr/(N^N2(N-r-3)) ]/(N-2) or [d^^ (N-r-3) / (N-2) ] can 
be used. The former is an unbiased estimator* but Lachenbruch and Mickey 



24 



20 

(1968) used the latter in the DS-method to avoid negative esf^ nates of the 

population generalized distance. Once either term Is used In (B.I4), the 

2 9 
resulting estimate of e(D ) can be substituted into (B.l) for D to estl- 

c c 

mate mln(£ ), which can serve as a lower bound estimate of E . The mnemon- 

c c 

Ics given this procedure of estimating through an estimate of mln(E ) are 

c 

2 

the MS-method when the biased estimate of 5 is used, and the MU-^ethod 

P 

2 

when the unbiased estimate of 5 is used, 

P 
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APPENDIX C 

PROOF THAT MIN(E ) IS A FUNCTION OF D ^ 
c c 

The claim has been made that the minimum actual error rate associated 
with us i of a sample classification rule in the population is a function of 
the "shrunken" Mahalanobis distance between subpopulation centroids along 
the dimension defined by the sample discriminant function. This claim is 
proved here for the case where costs of misclassif ication are assumed to be 
equal* 

The Marginal Distribution Along 1^ 

Recall that the two subpopulations and K2 follow multivariate normal 
density functions with centrrlds and and a common covariance matrix 2» 
It is well known that a nonsingular transformation of a multivariate normal 
population will produce transformed variates which also follow a multivari- 
ate normal distribution* Another property of multivariate normal popula- 
tions is that the marginal density functions are univariate normal with 
means and variances obtained by taking the appropriate components of P, and ^ 
(Anderson, 1958). 

The sample discriminant weights 1 ' can be viewed as one row of a non- 

" 8 

singular i:-by-r transformation matrix which reorients the reference frame 
in the r-dimensional space. The remaining (r-1) rows of the transformation 
are chosen such that the dimensions they produce are mutually orthogonal 
and orthogonal to the dimension formed by the discriminant weights 1 ' . 
(These r new dimensions are statistically independent by virtue of their 
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normality.) When the two subpopulatlon centrolds jj^ and ^ are projected 

onto the dimension defined by 1 * and accompanied by the translation, 

-lg'(Xj^ + X2)/2, the means 

(CI) V^* - 1^' [y^ - .5(X^ + x^)] 

and 

(C,2) V^* - 1^' - -SCx^ + x^)] 

are the end result. Within each subpopulatlon, the variance along this 

dimension Is 

(C.3) V^-V^is • 

The scores within each subpopulatlon along this dimension are distributed 
normally with mean y^* and variance V^. In formal notation, ve say that 
for er^h subpopulatlon k the density function for x* Is 

• 5* 



(C.4) ^k(-*> - «2.V^>^^' «P 



-.5 /X* - y. *^ 

k 



V -5 

V 



Minimum Actual Error Rate Associated with a Sample Classification Rule 

The mlnJmum actual error rate associated with a sample classification 
rule Is equal to the minimum total probability of mlsclasslf Icatlon along 
the dimension Hc^flned by the sample discriminant weights 1 « To obtain the 
minimum total probability of mlsclasslflcatlon along this dimension, an 
optimal classification rule along this dimension Is needed, that Is, the 
cutoff score must be adjusted. Welch's (1939) solution Is as applicable to 
classification along this slngl? dimension defined by 1 as It is to clas-- 

"•S 

slf Icatlon In the original r^dlmenslonal space. 



27 



23 

General Solution 

Let £|^(x) bs£ w^ie density function of x If It comes from subpopulatlon 
K. Let q^^ be the proportion of the total population that Is In subpopula*- 
tlon K. Assign x to If x Is In some region R^^ and to If x Is In a 
region Assume that R^^ and R2 are mutually exclusive and that their 

union Includes the entire space R. The total probability of mlsclasslf lea*- 
tlon Is 

(C.5) Ep - qi/R^fi(x)«bc + 

- qj[/f,(x)dx - /j^^f,(x)dx] + <i^I^t^(.-g)6x, 

- q^/f^(x)dx + /Rjq2f2^*^ " <lifi(x)]d« • 

In order to minimize E , R, should be chosen such that 

P 1 

(C.6) fK^f^i-g.) - q^f^(x) < 0 . 

Thus the classification rule is to assign x to if t ^{n) 1 1 ^ix) > 'i.2l'^\ 

and to if fj^(x)/f2(x) < q2/qi' fj^(x)/f2(x) - q2/qi it is a tossup. 

Optimal Classification Along the Dimension Defined by 1 

The density functions for scores on the dimension defined by the sample 
discriminant weights are defined In (C.4). Thus the ratio of f^^Cx*) to 
f2(x*) Is 

(C.7) ^1^**^ " ^^''V^^'^ exp[-.5(x* - Wi*)^/V^] 



f-(x*) (2TrV expr-,5(x* - W.*)^/V 1 

Z V Z V 



- exp[-5(x*^ - 2x*u * + y *^)/V + ,5(x*^ - 2x*y,* + W,*^)/V J 

- exp[x*(u,* - y *)/V - ,5(u * - u *)(u * + w,*)/V„] 

- exp[(x* - .5(u^* + W,*))(W^* - W2*)/V^] 



erJc 
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Taking the natural logarithm yields the optimal classification rule, which 
Is to assign x* to K^^ If 

(C.8) Wp(x*) - [[x* - .5(y^* + ^2*^1^^!* " V^^^] ^ ^^^V^l^ 



and to If 



(n.9) W (x*) < InCq^/q^) . 



Note that W^(x*) has the standard form of an optimal classification rule: 
the scores x* are multlt)lled by a scaling factor (Pj^* - ^2*^^^w' ^^^^^ ^® 
the univariate expression for the coefficients of Fisher's df.scrimlnant 
function and then this score Is adjusted by subtracting the additive con- 
stant (yj^* + ^2^)12. 

Since W (x*) follows a univariate normal distribution, we are able to 
P 

calculate the optimal error rate (In this space of reduced dimensionality) 
by using the cumulative normal distribution function <^(z)« The probability 
of a member of being mlsclasslf led by W^(x*) Is 



(C.IO) - Probt(W (x*) < InCq^/q^)) Ik^ ] 



To use the emulative normal distribution function, scores on W (x*) must 

P 

be standardized. Let z,* equal the scores W (x*) standardized In the met- 

1 P 

rlc of K^, 

(C.ll) Z * - (W^(x*) - W^(y *))/(V *)-^ , 
i p P V 

where V * Is the variance of the W (x*) scores. The mean W (P, *) equals 
w p P 

(C.12) W^(u *) - [y * - .5(u * + W *)][(W * - V *)/V ] 

pi i iZ iZV 

- .5(u^* - W2*) t(W^* - 1^2*^ ^^w^ 

- .5(Uj* - V2*^^^\ ' 
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When the relationships In (C.l) - (C.3) are substituted into (C.12), 

.2- 



(C.13) ^('^l*) ■ -5 



1 'E 1 

» — 8 — 8 



it reduces to the "shrunken" Mahalanobis distance divided by two. An anal- 



ogous derivation for V^iv^*) yields 

(C.14) V(v*) - -.5D^^ . 
p z c 

The variance of W^Cx*) in either subpopulation, Kj^ or can be expressed 



as 



-1. 



(C.15) V^* - (y^* - 1^2*^ Ww'^^* - ^2*^ 



In terms of the standardized variable ^^*» can be expressed as 



(C.16) 



- Prob 



InCq^/qp - .5D^ 



which can be rewritten as 



(C.17) ''l 



in(q2/qi) - • 



5D 



2-1 



where ^(Zq) is the cumulative distribution function of normal variable with 
mean zero and variance unity evaluated at z., i.e.i 
(C.18) ♦(zq) - _„/ °(2ir)"^ exp[-5(z^)]dz 

To deterzii'ae the probability of misclassifying a member of K^* an anal- 
ogous derivation is followed, yielding 
(C.19) ?2 - ♦[- (ln(q2/qi) + .5D^^/D^] 

- 1 - ♦[(ln(q,/q,) .5D^^)/D^] . 

^1 c c 
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The total probability of mlsclasslflcatlon can then be expressed 



(C.20) 



ln(q2/qi) - -SD^ 



2n 



+ 



- In(q2/qp + -SD^ 



2-1 



• mln(E ) 
c 

It has Just been demonstrated that the minimum actual error rate asso- 
ciated with use of the sample classification rule In the population Is a 
function of the "shrunken" Mahalanobls distance between subpopulatlon 
centrolds projected along this dimension. 

Re lationship Between W (x) and W (x*) 
8 - p 

The rule W (x*) Is the optimal classification rule along the dimension 
P 

defined by the sample discriminant weights. It Is Interesting to compare 

W^(x*) to the sample classification rule To make this comparison 

possible » it is desirable to express W (x*) in terms of the original varl- 

P 

ables X. Using the relationships in (C.l) - (C.3), (C.8) can be rewritten 
as 

(C.21) Wp(x*) - Ig'tx- -SCXj + X2)][sl , 
where s is the scaling constant 

'1 - VJI • a ^ 2/(a 1 ) 

-s -1 -2^ y c -yx-8 



(C.22) 



1 '11 



where a is the variance of the criterion of group membership and £^ is 
y — yx 

the 1-by-r vector of covariances between group membership and the r predic- 



tor variables. Let's define d as the difference in sample centrolds 



31 



27 



and as the difference In population centrolds, such that 
(C.23) 4 - dp - [(xj - - (Jil - ii2^^ • 
This expression allows us to rewrite (C.21) as 

(C.24) Wp(x*) - [l^»(x - .5(x^ - .5(x^ - X2)) + -Sl^'Cd, - i^nis] 

which expresses W (x*) as a function of W (x) , the sample dasdlf Icatlon 
p s — 

rule. Note that the two rules differ by a constant 1 • (d - d )/2 and a 

-"8 —8 ^> 

scaling factor > When the subpopulatlons K^^ and are of equal size, the 

only Important difference between W (x) and W (x*) Is the additive con- 

8 - p 

stant, which Is a function of how well the sample centrold difference, x^ 
x^, approximates - y^, the population centrold difference. To the 
extent that this approximation Is good, the actual error rate associated 
with Wp(x) will approach the minimum actual error rate associated with 
W^(x*). For example. If d - d - 0, then W (x) Is merely a rescallng of 

p " —8 —p 8 — 

Wp(x*) by a multiplicative constant 1/s, and the standardized version of 

W (x) Is Identical to the standardized version of W (x*). 
s p 
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APPENDIX D 



RELATIONSHIP BETWEEN R ^ AND D ^ 

c c 



Consider applying sample discriminant weights 1 to the population 
scores and computing the correlation between the binary criterion variable 
of group membership and the scores obtained by applying 1^ to the r predic- 
tors. The sample discriminant weights, are 
(D.l) 1^ - C"^x^ - , 

where C Is the pooled within groups covarlance matrix and x^^ and x^ are the 
centrolds In samples from subpopulatlons K^^ and K^s respectively. The 
squared correlation (or cross^-valldlty coefficient) between the binary 
criterion and the predictions based on 1 ' Is 

(D.2) - (1 •£,J^/(<',^1 1 ) . 

c — s -xy y — s xx— s 

which » due to the binary nature of the criterion Y, can be rewritten as 

(D.3) ((fr52^'c"^(vii2)«iiq2^^ 



q^q2 (Xi-X2)'C-4^^C-^x^-X2) 

_ qiq2^%-^2^'^'^^V-H2^^^ 

Note from (A. 9), that can be expressed as the sum of a wlthln-groups 
and a between-groups covarlance matrix. Thus» (D.3) can be rewritten as 

(D.4) «ll'l2((ir52^'="'(A-^2)^^ 

(x^-X2) 'C"^ [E + qiq2(iJi-Jf2) ((i!rif2) ' ^C"^(x,-X2) 
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(x^-X2) 'C'^EC"^(X^-X2) + (f 1-X2) 'C'^q^q2(yi-y2) (lii-li2) 'C"^2i-i2> 
Note the two scalar quantities 

(D.5) (Xi-X2)'C^-\yi-y2) - (yi-ii2)'C„"'%-l2> 

are equal to each other and define the difference between centrolds pro- 
jected onto the discriminant dimension defined by 1 Hence, (D.4) can be 
rewritten as 

(D.6) \^ - <ll<l2 (Srl2>'^"^(iil-il2> ^ 



(x^-X2)'C"^ZC"\x^-X2) + q^q2 (E1-X2) 'C"\y i-y2) ^ 
Rearranging terms In (D.6) yj Ids 

(D.7) Rc^[(gi-X2)'C'^ZC"^x^-X2)] - qiq2 [%-X2) ' C"\y^-y2) 1^ 
and 

(D.8) /(V2>- (El-E2)'C"^(lii-H2> ^ 



Upon noting the definition of 1 ' In (D.l), D.8) can be rewritten as 

■~s 

(D.9) /(V2> - i8'(iil-ii2> ^ 



(1-R 1 'Zl 

C ' -8-8 

The expression on the right hand side of the equality is the "shrunken" 
Mahalanobis distance between population centroids along the dimension 
defined by 1 ' . Hence, it has been shown that 

^"8 



(D.IO) r2 
c c 



(1-r/) 
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